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Comment on Yu et al., "High Quality Binary Protein Interaction Map of the Yeast 
Interactome Network." Science 322, 104 (2008). 

A. Clauseti^B 

^ Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM, 87501, USA 

We test the claim by Yu et al. — presented in Science 322, 104 (2008) — that the degree distribution 
of the yeast {Saccharomyces cerevisiae) protein- interaction network is best approximated by a power 
law. Yu et al. consider three versions of this network. In all three cases, however, we find the most 
likely power-law model of the data is distinct from and incompatible with the one given by Yu et al. 
Only one network admits good statistical support for any power law, and in that case, the power 
law explains only the distribution of the upper 10% of node degrees. These results imply that there 
is considerably more structure present in the yeast interactome than suggested by Yu et al, and 
that these networks should probably not be called "scale free." 



Protein-interaction networks, where nodes are natu- 
ral proteins and links represent non-trivial binding affin- 
ity between two proteins, hold great promise for push- 
ing forward our understanding of cellular processes. The 
wide interest in these interactome networks stems mainly 
from the observation that while modern genetic meth- 
ods allow us to identify which genes code for proteins, 
the set of these genes only amounts to a "parts list" of 
a cell. A more complete understanding of cellular pro- 
cesses requires knowing something about the functional 
roles and dynamic interactions of these parts 0. Thus, 
by considering the patterns of protein interactions, i.e., 
their network, we can pose and answer meaningful bio- 
logical questions about complex cellular processes, and 
their evolution. 

Determining the protein-interaction network for a par- 
ticular species is a highly non-trivial task, and relies 
upon sophisticated molecular techniques to both build 
the proteins, and to test their pairwise interactions. The 
most direct approach would be to individually test each 
of the rt^ possible interactions for n proteins. But, be- 
cause n is typically on the order of thousands or tens 
of thousands, and high-throughput methods are not yet 
available for these tests, this approach is not used. In- 
stead, researchers use techniques that test multiple inter- 
actions at once 0, i], e.g., the yeast two-hybrid (Y2H) 
screen [5|. To date, there have been a number of high 
profile efforts to construct the interactome for yeast 
{S. cerevisiae) [1, 0, i, i, 

and, many would argue, 

a lot of real progress. 

However, these methods also have serious limitations, 
and have been shown capable of producing_ high false- 
positive and high false- negative rates [111, • Some of 
the techniques also exhibit severe biases, being unable to 
test for interactions involving entire classes of proteins. 
The Y2H assay, for instance, is not suitable for tran- 
scriptional activators or membrane-bound proteins. As 
a result of these limitations, some scientists have won- 
dered privately how much real biology our current maps 
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actually capture. 

The Yu et al. paper attempts to address some of 
these shortcomings using a new high-throughput Y2H 
screen. Combining its results with those of previous 
studies, they produce a new set of interactions (denoted 
Y2H- union), which they suggest covers about 20% of 
the entire network. In the interest of completeness, 
Yu et al. include in their subsequent network analyses 
two alternative versions of the yeast interactome: one is 
based on co-complex information drawn from raw high- 
throughput coaffinity purification and mass spectrometry 
data (denoted Combined- AP/MS), and one is based on 
a smaller set of literature-curated interactions (denoted 
LC-multiple) that are assumed to be error free. 

This Comment is not concerned with the quality of 
the laboratory procedures or the accuracy of the inferred 
interaction data. Rather, the focus is relatively narrow, 
concerning only the analysis of these networks' degree 
distributions and the conclusions drawn thereby. In par- 
ticular, Yu et al. claim that the degree distributions of 
all three networks follow power-law distributions (with 
parameters given in Table |l| : 

As found previously for other macromolecular 
networks, the connectivity or "degree" distri- 
bution of all three data sets is best approxi- 
mated by a power-law. [l| 

The implication thus being that the yeast proteome can 
be considered "scale free" , with all that goes along with 
that label [13]. However, this claim depends on a sta- 
tistical method — specifically, linear regression — that is 
known to produce biased and incorrect results in this 
context. By reanalyzing Yu et aVs data [l3| with ap- 
propriate statistical tools [15j] , we show that (i) the most 
likely power-law models of these networks' degree dis- 
tribution are distinct from and incompatible with those 
quoted by Yu et al, (ii) at best, only the 10.3% most 
connected nodes in the Y2H-union network are plausi- 
bly power-law distributed, and (iii) there is considerably 
more structure present in all of these networks than sug- 
gested by Yu et al. As a result, these networks should 
probably not be considered "scale free." 

We begin by reanalyzing the Y2H-union data, argued 
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FIG. 1: Three approaches to fitting a power-law distribution to the Y2H-union data from Yu et al. [f*]. (a) The parameterization 
given by Yu et al., derived using standard linear regression on the log-transformed histogram of degree frequencies, (b) A fit 
derived via maximum likelihood over the entire range of the data, i.e., Xmin = 1. (c) A fit derived via maximum likelihood for 
estimating q, and by selecting the Xmin that gives the best power-law fit to the upper range of the data. In both (a) and (b), 
the fitted power-law is not statistically significant (p — 0.00 ± 0.01), indicating that large or systematic deviations from the 
power-law hypothesis exist. In (c), the upper 10.3% are plausibly power-law distributed {p = 0.95 ± 0.03). 



by Yu et al. to be the most accurate map among the 
three. Figures [T^-c show the data along with three dif- 
ferent power-law models. The first panel shows the model 
suggested by Yu et al. (with scaling exponent a = 2.4) 
which was derived using a hnear regression approach; this 
model yields a poor fit to the lower-to-middle range of 
the data. The second panel shows the most likely power- 
law model over the entire range of data, which fits the 
lower range and thus the majority of the data relatively 
well, but yields a poor fit to the upper range. The third 
panel shows the most likely power-law model for the up- 
per range alone. 

Notably, Figures [T^-c plot the data as a complemen- 
tary cumulative distribution function (CDF). If the data 
were indeed power-law distributed, this function would 
be straight on the log-log axes, but the reverse is not true. 
Being straight on log-log axes is not a sufficient condition 
for some data to be power-law distributed; many kinds 
of non-power-law distributed data can look straight on 
log-log axes. To decide whether some data do or do not 
follow a power-law distribution, we must use statistical 
tools that can tell the difference [l^. Linear regression 
is not one of these. 

A straightforward test for whether some data can rea- 
sonably be claimed to follow a power-law distribution is 
a significance test [l^, [13] • When we fit the power-law 
model to the same data that we use to score the power 
law's plausibility, we induce a correlation between the 
data and the model; however, this correlation can be con- 
trolled using a Monte Carlo procedure _15,] . The result of 
the test is a single value p that represents the plausibility 
of the fitted model as an explanation of the data: small 
p-values, conventionally p < 0.1, indicate that the data 
cannot be considered to follow the fitted model, while 
larger values indicate only that the fitted model is plau- 
sible (not that it is correct) [l8|. Conducting such a test 
with the Y2H-union data and the power-law claimed by 
Yu et al. yields p = O.OOiO.Ol, indicating that this power 



law is a terrible model of the data, and that the data devi- 
ate in large or systematic ways from the proposed power 
law. 

The poor fit here is partly due to the way the power law 
was estimated from the data: linear regression for fit ting 
the power law is known to produce biased results IT9| 
mainly by dramatically overweighting the large but rare 
events. Further, the canonical value, used to judge the 
quality of the regression, can easily be high even when 
the data are not power-law distributed p^. In short, re- 
gression applied in this way makes assumptions about the 
model that are incompatible with the hypothesis being 
tested, and thus fails in uncontrolled ways. 

A more reliable method for fitting a power-law dis- 
tribution to data is the method of maximum likeli- 
hood [1^, [l^] . Figure [Hd shows a power-law model fit- 
ted to the entire range of data, while Figure [T]; shows 
a power law fitted only to the upper range. The sec- 
ond of these requires choosing a point Xmin where the 
power-law behavior starts, but this can be done using 
appropriate tools [Tsj . Applying the significance test to 
both models, we find that even the most likely power 
law is still a terrible explanation of the entire range of 
data {p = 0.00 ± 0.01), but that it's an entirely plau- 
sible explanation (p = 0.95 ± 0.03) of the 209 (10.3%) 
most connected nodes, i.e., those with degree a; > 6. A 
corollary, however, is that the distribution of the degrees 
1 < a; < 6 is not a power law. (One possibility is that the 
low-degree data is power-law distributed in a way differ- 
ent from that of the high-degree nodes. Fortunately, the 
same tools we have already used can be readily adapted 
to test this hypothesis.) 

Figure[2]repeats the analysis of Figure[T}: with the data 
for the alternative interactome maps. For these, we find 
that the Combined- AP /MS network's degree distribution 
cannot be considered to follow a power law {p — 0.01 ± 
0.03), while the support for the LC-multiple network is 
marginally significant {p — 0.15 ± 0.03). In both cases, 
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FIG. 2: The Combined-AP/MS and LC-multiple data sets, shown as complementary cumulative distribution functions 
Pr(X > x) on log-log axes, along with fits of the power-law hypothesis using the same methods as in Fig. [ij;. 
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TABLE 1: Power-law models for the three versions of the yeast interactome. In the first few columns, we quote the models 
given by Yu et at, which were derived using regression methods. In the second set, we give the best fits derived by maximum 
likelihood when the range of fit is allowed to vary (as in [1^1), along with the corresponding p- value. In the final column, we 
state the support for the conjecture that the corresponding data follows a power-law distribution. In every case, the maximum 
likelihood power law is distinct and incompatible with that given by Yu et ai, and only for the upper 10.3% of the Y2H-union 
data does the power-law hypothesis have strong statistical support. 



the best power-law models cover only the upper range of 
data, and power laws that cover the entire range have 
zero support {p — 0.00 ± 0.03). Table U summarizes the 
results of our reanalysis of the three networks. 

As a brief aside, Yu et al. also conducted a model- 
comparison exercise to test whether a power-law distri- 
bution with or without an exponential cutoff was a bet- 
ter explanation of the data. For the same reasons given 
above, the regressions and values Yu et al. use cannot 
reliably determine which of these models is better. For- 
tunately, reliable tools for answerin g su ch a question do 
exist, e.g., a likehhood ratio test [Hjm, although we do 
not apply them here. 

With these results in hand, we can now make several 
novel conclusions about the structure of protein interac- 
tions in yeast, and generally clarify the results of Yu et al. 
First, the question of whether the yeast interactome's de- 
gree distribution is well-characterized by a power-law dis- 
tribution is not yet settled. Yu et al. argue that the Y2H- 
union network is the most accurate map to date (more 
accurate than the Combined- AP /MS or LC-multiple ver- 
sions), but here the statistics only support the notion 
that a power law is a plausible model of the degrees of 
a small fraction of the entire network (10.3%, or the 209 
most connected nodes). The LC-multiple network was 
presented as being a smaller, but generally high-quality. 



data set and here there is only marginal statistical sup- 
port for a power-law degree distribution in the upper 
range. Notably, the two power laws for these networks 
are largely incompatible, with Q!Y2H — 2.9 ± 0.2 versus 
q;lc = 3.3 ± 0.2. If the Y2H-union network were merely 
a more complete version of the LC-multiple network, a 
greater degree of overlap in these estimates would be ex- 
pected. The implication is that there are significant dif- 
ferences between these networks that cannot be explained 
away by simple sampling arguments. 

Further, the fact that the best power-law model of the 
Y2H-union data only explains the distribution of the up- 
per 10.3% of the node degrees implies that there is con- 
siderable structure in this network that remains to be 
explained. This structure may have evolutionary or func- 
tional significance, especially considering that its behav- 
ior is qualitatively different from that of the large-degree 
nodes and that it accounts for almost 90% of the net- 
work. Additionally, the existence of the cross-over point 
at a;niin = 6 from non-power-law to power-law behavior 
deserves a scientific explanation. In the additional struc- 
tural analyses performed by Yu et al., the Xmin value 
could serve as a principled threshold by which to quan- 
titatively define "high degree" (see for instance [23|). 

In closing, we note that there are many aspects of the 
Yu et al. study that seem entirely reasonable, and much 
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of the paper concerns the experimental work done to con- 
struct the Y2H-union version of the yeast interactome. 
From this perspective, the paper pushes the field forward 
in a meaningful way, and the problems discussed here are 
a small part of a very large project. 

On the other hand, the goal of constructing and analyz- 
ing the yeast interactome is to ultimately understand the 
mechanisms that create the observed patterns of interac- 
tions, and their implications for higher cellular functions. 
Scientific progress on these questions certainly depends 
on high quality experimental work, but it also depends on 
high quality statistical work: to get the theories right, we 



must also get the statistics right. Otherwise, we cannot 
know for sure what the data do and do not say. For test- 
ing whether some data do or do not follow a power-law 
distribution, reliably accurate tools now exist, and their 
application can shed considerable light on the relevant 
scientific questions. 
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